In our previous work, we analysed the factors that drive reduced prediction accuracy of polygenic scores for height in individuals with African ancestry.
We saw that SFS and LD play a role, but there is also suggestive evidence that differences in marginal effect sizes exist.
In that study we ran a GWAS in ~8,000 individuals with African ancestry from the UKBB and tested for differences in marginal effect sizes between those and European derived effect sizes, as well as correlations of those differences with allelic frequency differences. Finally, we implemented ancestry-informed PRSs in the admixed individuals, and observed only very modest improvement in prediction accuracy.
It is possible that that modest improvement was due to our low sample size. So here we use a much larger sample size (about 58K African ancestry individuals and 91K total) to explore the potential of ancestry-informed PRSs for height. We also try a larger meta-analysis, with 58K African ancestry individuals and
For now, we are focusing on height only.
We use GWAS summary statistics for height from four sources: *Uganda Genome Project - which is a meta-analysis of Uganda + 3 other populations from Africa);
pan UKBB (both the ‘African’ subset and the entire set) - this is a height GWAS done for each subpopulation in the UK Biobank separately;
N’diaye et al. 2011 - still the largest height GWAS performed in African ancestry individuals;
PAGE, a large meta-analysis including 35% African Americans and the remaining participants are mostly of Hispanic/Latino and other minority ancestries.
Most were in hg19 build, except N’diaye, which we lift over to hg19 from hg18. Previous filtering was done in each of these studies, and there is often not enough information for us to perform our own filtering.
UGP: this is very recent. They filtered for imputation score > 0.3.
pan-UKBB: They filter for INFO scores > 0.8 and minimum allele count of 20 in each population. They also provide a True/False filter for “low_quality_AFR” which we use, retaining only those for which it is ‘false’. We used this filter in both the meta-AFR and meta-ALL meta-analyses. GWAS included: Age, sex, Age*sex, Age2, Age2*sex, the first 10 PCs. Inverse-normal transformation of height in cm.
N’diaye et al.: The genomic control inflation (GC) factor was calculated for each study and used for within-study correction, prior to the meta-analysis. The overall lambda they report is 1.064 (which we confirm, see table below) suggesting no inflation in this meta-analysis. Imputation info score not available, but authors filtered for >= 0.3. Betas and SE in units of z-score.
PAGE: inverse-normal-adjusted residuals for each trait outcome. Info score available. Filtered for > 0.4 by authors prior. We were more strict and filtered for > 0.8.
We performed two meta-analysis:
*meta-AFR: UGP+pan-UKBB(AFR)+N’Diaye et al. 2011 meta-analysis, PAGE project. Total of 91,028 individuals (58488 of African ancestry). See Table 1.
*meta-ALL: UGP+pan-UKBB(All ancestries) + Biobank Japan (BBJ) + N’Diaye et al. 2011 meta-analysis, PAGE project. Total of 726320 (58488 of African ancestry). See Table 1.
Note that both have the same amount of African ancestry individuals. We performed meta-ALL to check whether bigger sample size, even if not for African ancestry, would change predictions.
We ran a meta-analysis using METAL using one file for each of the above datasets. We set genomic correction to “ON”, meaning it is performed for each file (not the final values). We performed the meta-analysis using SCHEME STDERR, meaning betas and SE are used. For the meta-AFR analysis, we set AVERAGEFREQ and MINMAXFREQ to “ON” so that metal can track large allelic frequency differences across datasets as suggestion of allelic mismatch. We only report results for variants that have a combined weight of at least 45000 (meta-AFR) or 360000 individuals, resulting in about 32.5 million autosomal variants in both datasets.
We inspected the p-value distribution of these meta-analyses using QQ-plots and calculated the genomic inflation on the final p-values, and performed corrections accordingly.
Fig 1: QQ-plot for meta-AFR. P-values were GC corrected within each dataset, but not the final meta-AFR analysis. Based on the lambda value show here, downstream analyses referred to corrected final p-values.
Fig 2: Manhattan plot for meta-AFR GWAS (top) and UKBB_EUR (bottom, for comparison). Genome-wide significance line is plotted at P=1e-08. SNPs above that threshold are plotted in red (49 and 3,308 SNPs for meta-AFR and UKBB_Eur, respectively). SNPs above threshold in both meta-AFR and UKBB_eur GWAS are plotted in orange (8 SNPs). SNPs were clumped prior to plotting (P<=5e-04, 100Kb window).
Genotype data from test cohorts was lifted over to hg38 when needed.
PMBB (Penn Biobank): with sets of EUR (7501) and AFR ancestry individuals (9226)
UKB_CHI (UKBB Chinese): a set of 1,504 individuals with Chinese ancestry from the UK Biobank.
HRS (Health and Retirement Study): with sets of EUR (10,486) and AFR (2,322) ancestry individuals.
We visually inspected qq-plots of height residuals for each dataset to check for extreme outliers. Based on this inspection, we restricted PMBB (Figs 3-4 for before and after filtering) and HRS (Figs 5-6 for before and after filtering) samples to those for which residual height was between \(\pm3\) standard deviations from the mean for each sex. For UKB-CHI, no filtering was necessary (Fig 7). Height residuals were obtained by regressing height on all co-variates and their interactions for each individual:
\[height\sim Sex+Age+Age^2+Sex*Age+Sex*Age^2+pEUR+Sex*pEUR+Age*pEUR+Age^2*pEUR\]
, where \(p_{EUR}\) is the genome-wide average proportion of European ancestry for PMBB_afr and HRS_afr (estimated through RFMIx), and the European ancestry component (estimated through unsupervised ADMIXTURE with k=2) for UKB_CHI. For HRS_eur and PMBB_eur, we set \(p_{EUR}\) to 1.
When multiple time points were available for each individual, we retained the one corresponding to the latest height measure and age. All height phenotype data was formatted to be in centimeters.
Each test cohort was randomly divided into a “train” and a “test” set following the ratio of 0.15 (train) and 0.85 (test) for most datasets, except for UKB_CHI and HRS_afr, where we used 0.20:0.80 (Table 2). We performed a stratified split of the data using the initial_split function from the rsample R package. We used ‘Sex’ as strate, i.e, to maintain Sex proportions within training and testing sets similar (Table 2)
We used LDpred for PRS calculations. For UKBB_eur summary statistics, we used the UKBB_eur as LD reference panel; for BBJ and meta-AFR we used East Asians and Africans from 1000G Phase 3, respectively. We first ran ldpred coord to coordinate the summary statistics, test and LD datasets. Next we ran the gibbs sampler. Many values of p did not covnerge, but typically p=1 and p=0.3 did converge, so we looked at those, as well as the infinitesimal model. See Table
PRS_eur: PRS using effect sizes (\(\beta\)) from UKBB_eur.
PRS_eas: PRS using effect sizes (\(\beta\)) from BBJ (all East Asian).
PRS_afr: PRS using effect sizes ((\(\beta\)) from the meta-AFR GWAS.
PRS_all: PRS using effect sizes ((\(\beta\)) from the meta-ALL GWAS.
\[height~Sex+Age+Age2+pEUR\]
\[height~Sex+Age+Age2+pEUR+PRS_{eur}\]
PRS1_ML (described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020)
PRS2_BD - linear combination of PRS described in Bitarello & Mathieson 2020
Fig 3: Partial-R2 for height PRS for different test cohorts and using different summary statistics and LDpred models
## the winner for pmbb_afr is:
## Alpha Test Train Model R_sq Anc2 Anc1 Base_Rsq R_sq_fold
## 1: 0.695 PRS2_BM pmbb_afr LDpred-Inf 3.32 afr eur 2.01 1.651741
## the winner for hrs_afr is:
## Alpha Test Train Model R_sq Anc2 Anc1 Base_Rsq R_sq_fold
## 1: 0.574 PRS1_ML hrs_afr LDpred-Inf 2.68 afr eur 1.25 2.144
## the winner for chi is:
## Alpha Test Train Model R_sq Anc2 Anc1 Base_Rsq R_sq_fold
## 1: 0.336 PRS2_BM chi LDpred-Inf 4.99 eas eur 3.62 1.378453
Fig 4: Linear combinations of PRS for PMBB_afr. PRS1_ML, described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020. PRS2_BM, described in Bitarello & Mathieson 2020. Fold increase relative to alpha=0. Anc2 is the ancestry of the second PRS component.
Fig 5: Linear combinations of PRS for PMBB_afr. PRS1_ML, described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020. PRS2_BM, described in Bitarello & Mathieson 2020. Absolute values. Anc2 is the ancestry of the second PRS component.
Fig 6: Linear combinations of PRS for HRS_afr. PRS1_ML, described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020. PRS2_BM, described in Bitarello & Mathieson 2020. Fold increase relative to alfa=0. Anc2 is the ancestry of the second PRS component.
Fig 7: Linear combinations of PRS for HRS_afr. PRS1_ML, described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020. PRS2_BM, described in Bitarello & Mathieson 2020. Absolute values. Anc2 is the ancestry of the second PRS component.
Fig 8: Linear combinations of PRS for UKB_CHI. PRS1_ML, described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020. PRS2_BM, described in Bitarello & Mathieson 2020. Fold increase relative to alpha=0. Anc2 is the ancestry of the second PRS component.
Fig 9: Linear combinations of PRS for UKB_CHI. PRS1_ML, described in Marquez-Luna et al. 2017 and Bitarello & Mathieson 2020. PRS2_BM, described in Bitarello & Mathieson 2020. Absolute values. Anc2 is the ancestry of the second PRS component.
Conclusions:
Optimized tuning occurs for \(\alpha=0.695\) for both PRS1 and PRS2 using ‘AFR’ as the second ancestry.
For PRS1_ML: Under these parameters, we see a 1.62 and 1.65 fold increase from PRS_eur and PRS_afr respectively.
For PRS1_ML: Under these parameters, we see a 1.65 and 1.15 fold increase from PRS_eur and PRS_afr respectively.
[ongoing…]